Last week

We learned about working with strings, factors, and dates:

Objectives

Today we’ll learn about how to visualize data. Some of today’s examples come from Healy (2017). By the end of this session, you should be able to:

  • Describe the main arguments of a ggplot() call
  • Map plot aesthetics to variables
  • Style plots by color, fill, and shape
  • Facet plots to make small multiples
  • Set the theme and add labels

Login to Duke’s Docker-ized version of RStudio Server

  • Login to your instance by going to https://vm-manage.oit.duke.edu/containers and entering your NetID.
  • Click on Docker
  • Click on RStudio
  • When RStudio loads, restart the R session (Ctrl/Cmd+Shift+F10), clear the console (Ctrl/Cmd+L), and clear your workspace

Open your project

Is your project still open? If not, click on the project icon to load it. (Don’t create a new one.)

Start by loading a few packages

We’ll need:

  library(tidyverse)
  library(gapminder)

Know your audience

  • When you are creating plots during data inspection, cleaning, and exploration, you should emphasize speed and utility. Often you have an audience of 1.
  • When you need to share output with co-authors, save yourself time in the long run by adding titles, labels, and minimal notes.
  • When you are ready to share your work externally, expect to spend even more time creating publication-quality graphics. Remember, every figure + caption should be able to stand on its own without supporting details in the text.

Exploratory data analysis

  • EDA is an iterative process of asking questions about your data, searching for these answers through data visualization, transformation, and modelling, and then refining your questions (Wickham and Grolemund 2017).
  • EDA starts with data inspection and cleaning, but extends to a substantive exploration of data questions.
  • Wickham and Grolemund (2017) and Healy (2017) both emphasize the importance of looking at your data to really understand what you have.

Look at your data (r=0.6)

Ask

Wickham and Grolemund (2017) suggest the following EDA questions:

  • What type of variation occurs within my variables? (distributions)
  • What type of covariation occurs between my variables? (associations)

Categorical

If you have categorical data, start with a bar chart to summarize the distribution of values.

Continuous

If you have continuous data, try a histogram.

A quick plot

Sometimes the fastest plot is a base R plot.

  hist(diamonds$carat)

Some fine tuning

  hist(diamonds$carat, 
       main="Histogram of carat size", 
       xlab="Carat size", 
       border="black", 
       col="red",
       las=1, 
       breaks=10)

The ggplot() way

Grammar of graphics

  • ggplot is a tidyverse package by Hadley Wickham that implements Wilkinson’s Grammar of Graphics, a helpful approach for thinking about the components of an effective visualization of data.
  • In this session we will focus on Wickham’s implementation of this “gg” idea in his package ggplot.
  • For more background on visualization principles and what makes a good plot, see Healy (2017) for a nice overview. See also work by William Cleveland, such as The Elements of Graphing Data.

The ggplot way (Healy 2017)

The ggplot way (Healy 2017)

Start by defining the data

This line tells ggplot() which dataset to use and produces a blank plot.

  ggplot(data = gapminder)

Layering

For convenience, we’re going to assign each step to an object called p. You can call it whatever you want. The key idea is that we create a base plot p and add to it in each step. So here, p is just an empty plot. If you want to see the result, you have to print p.

  p <- ggplot(data = gapminder)
  p

Inspect the data

  glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan,...
## $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Declare data and mapping

The first two ggplot() arguments are data and mapping. We’ll drop the data= and mapping= labels from here out.

  p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
  p <- ggplot(gapminder, 
              aes(x = gdpPercap, 
                  y = lifeExp)) # same thing

The aes() function

  • The mapping argument calls for aesthetic mappings of variables to plot elements.
  • Essentially, with aes() you tell ggplot() which variable from the dataset should map to the x-axis, and which should map to the y-axis.
  • Here, we are mapping two variables from the dataset gapminder: gdpPercap goes to the x-axis, while lifeExp goes to the y-axis.
  p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))

What do we have so far?

Not much. We’ve just told ggplot to use the gapminder dataset and to map two variables, but we have not specified the type of plot we want.

  p

Specify a geom()

Use the + sign to add the next layer to this plot—a geom()! In this example, we add geom_point(), the points geom.

  p + geom_point()  # not assigning to p on purpose

Fine tune geom_point()

Check out the help file for your geom to learn more about use or review the great reference material on tidyverse.org: http://ggplot2.tidyverse.org/reference/geom_point.html

  ?geom_point # learn about arguments

Pick a different geom

This geom calculates a smoothed line and shades the standard error. Check out the arguments to geom_smooth() to tinker with the smoothing function used.

  p + geom_smooth()

Add both geoms

  p + geom_point() + geom_smooth(method="lm") # change method

Rescale the x-axis

  p + geom_point() + geom_smooth() + scale_x_log10()

Add some scale labels

  p + geom_point() + geom_smooth() + 
      scale_x_log10(labels = scales::dollar)

Change the look

  p <- p + geom_point(color="purple",
                      alpha = 0.3, # color transparency
                      size=2) +
           geom_smooth(method="loess", 
                       color="#FCF221") + # htmlcolorcodes.com
           scale_x_log10(labels = scales::dollar)
  p

Add some labels

  p <- p + labs(x = "GDP Per Capita", 
                y = "Life Expectancy in Years",
                title = "Economic Growth and Life Expectancy",
                subtitle = "Data points are country-years",
                caption = "Source: Gapminder.")
  p

Change the theme

  p + theme_minimal()

Map aesthetics to variables

For instance, maybe instead of making all the points “purple”, we want to color the points by values in the variable continent.

  p <- ggplot(gapminder,
              aes(x = gdpPercap,
                  y = lifeExp,
                  color = continent))

Adding the geoms

  p + geom_point() +
      geom_smooth(method='loess') +
      scale_x_log10()

Can also map shape to point values

  ggplot(gapminder,
         aes(x = gdpPercap,
             y = lifeExp,
             shape = continent)) + # changed from color
         geom_point() +
         geom_smooth(method='loess') +
         scale_x_log10()

Map fill to se

  p <- ggplot(gapminder,
                aes(x = gdpPercap,
                    y = lifeExp,
                    color = continent,
                    fill = continent))

Adding the geoms

  p + geom_point() +
      geom_smooth(method='loess') +
      scale_x_log10()

Map aesthetics per geom

  p <- ggplot(gapminder,
              aes(x = gdpPercap,
                  y = lifeExp))
  p + geom_point(aes(color = continent),
                 alpha=0.6,
                 size=1) +
      geom_smooth(method='loess') + # just 1 line
      scale_x_log10()

Small multiples

The group trends are hard to see. Let’s try faceting by continent to make a series of “small multiples”. First we need to get back to our basic plot defining point and line color:

  p <- p + geom_point(color="purple",
                      alpha = 0.3, 
                      size=2) +
           geom_smooth(method="loess", 
                       color="#FCF221") +
           scale_x_log10(labels = scales::dollar)

facet_wrap()

  p + facet_wrap(~ continent)

Make it nice

  p + facet_wrap(~ continent, ncol = 5) +
      labs(x = "GDP Per Capita", 
           y = "Life Expectancy in Years",
           title = "Economic Growth and Life Expectancy on Five Continents",
           subtitle = "Data points are country-years",
           caption = "Source: Gapminder.") +
      theme_minimal() +
      theme(axis.text.x=element_text(size=6))

References

Healy, Kieran. 2017. Data Visualization for Social Science. http://socviz.co/.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.